SVMs for the Blogosphere: Blog Identification and Splog Detection
نویسندگان
چکیده
Weblogs, or blogs have become an important new way to publish information, engage in discussions and form communities. The increasing popularity of blogs has given rise to search and analysis engines focusing on the “blogosphere”. A key requirement of such systems is to identify blogs as they crawl the Web. While this ensures that only blogs are indexed, blog search engines are also often overwhelmed by spam blogs (splogs). Splogs not only incur computational overheads but also reduce user satisfaction. In this paper we first describe experimental results of blog identification using Support Vector Machines (SVM). We compare results of using different feature sets and introduce new features for blog identification. We then report preliminary results on splog detection and identify future work.
منابع مشابه
The Splog Detection Task and A Solution Based on Temporal and Link Properties
Spam blogs (splogs) have become a major problem in the increasingly popular blogosphere. Splogs are detrimental in that they corrupt the quality of information retrieved and they waste tremendous network and storage resources. We study several research issues in splog detection. First, in comparison to web spam and email spam, we identify some unique characteristics of splog. Second, we propose...
متن کاملTowards Spam Detection at Ping Servers
Spam blogs, or splogs feature plagiarized or auto-generated content. They create link farms to promote affiliates, and are motivated by the profitability of hosting ads. Splogs infiltrate the blogosphere at ping servers, systems that aggregate blog update pings. Over the past year, our work has focused on detecting and eliminating splogs. As techniques used by spammers have evolved, we have lea...
متن کاملBlog Track Open Task: Spam Blog Classification
Spam blogs or Splogs are blogs with either auto-generated or plagiarized content created for the sole purpose of hosting ads, promoting affiliate sites and getting new pages indexed. Splogs now rival generic web spam and e-mail spam, presenting a major problem to analytics on the blogosphere from basic search and indexing, to opinion, community, influence and correlation detection. This open ta...
متن کاملOverview of the TREC-2010 Blog Track
• Top stories identification: A task that addresses news-related issues on the blogosphere, namely investigating whether the blogosphere can be leveraged to identify the top news stories of a given day in a real-time fashion. The task has also a search diversity flavour, where for a given story, a representative set of blog posts discussing the story from various perspectives [7] is shown to th...
متن کاملIdentifying and Ranking Topic Clusters in the Blogosphere
The blogosphere is a huge collaboratively constructed resource containing diverse and rich information. This diversity and richness presents a significant research challenge to the Information Retrieval community. This paper addresses this challenge by proposing a method for identification of “topic clusters” within the blogosphere where topic clusters represent the concept of grouping together...
متن کامل